Introduction

From an outsider’s perspective, the Airbnb platform has revolutionized the renting of rooms much as Uber has revolutionized ridesharing. Airbnb allows users of the platform to monetize property that they would not be able to otherwise. In order to determine if renting out a room is a good business case or worth the owner’s time, it would be helpful to have an estimate of how much income the room would generate. Our proposal is to develop a model, trained on the London Weekend Airbnb data, that could predict the room rental income based on predictors known to the person putting the room up for rent. The data is available on Kaggle at https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities, and from the original authors, Gyódi, Kristóf and Nawaro, Łukasz at https://zenodo.org/record/4446043#.Y9Y9ENJBwUE.

The columns of the database include:

Load the data from a .csv file.

london = read.csv("london_weekends.csv")
london = subset(london, select = -X)

What are the dimensions of the dataset?

nrow(london)
## [1] 5379
ncol(london)
## [1] 19

There are 5379 rows of data with 19 variables, as described above.

Methods

Exploratory Data Analysis (EDA)

Inspect Data

head(london, 5)
##    realSum       room_type room_shared room_private person_capacity
## 1 121.1223    Private room       False         True               2
## 2 195.9124    Private room       False         True               2
## 3 193.3253    Private room       False         True               3
## 4 180.3899    Private room       False         True               2
## 5 405.7010 Entire home/apt       False        False               3
##   host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1             False     0   0                  6                         69
## 2             False     1   0                 10                         96
## 3             False     1   0                 10                         95
## 4             False     1   0                  9                         87
## 5             False     0   1                  7                         65
##   bedrooms     dist metro_dist attr_index attr_index_norm rest_index
## 1        1 5.734117  0.4370940   222.8822        15.49341   470.0885
## 2        1 4.788905  1.4640505   235.3858        16.36259   530.1335
## 3        1 4.596677  0.4503062   268.9138        18.69325   548.9876
## 4        1 2.054769  0.1326705   472.3813        32.83707  1021.2711
## 5        0 4.491277  0.3541075   318.4915        22.13958   692.7754
##   rest_index_norm      lng      lat
## 1        8.413765 -0.04975 51.52570
## 2        9.488466 -0.08475 51.54210
## 3        9.825922 -0.14585 51.54802
## 4       18.278973 -0.10611 51.52108
## 5       12.399473 -0.18797 51.49399

The structure of the dataset

str(london)
## 'data.frame':    5379 obs. of  19 variables:
##  $ realSum                   : num  121 196 193 180 406 ...
##  $ room_type                 : chr  "Private room" "Private room" "Private room" "Private room" ...
##  $ room_shared               : chr  "False" "False" "False" "False" ...
##  $ room_private              : chr  "True" "True" "True" "True" ...
##  $ person_capacity           : num  2 2 3 2 3 2 2 2 4 2 ...
##  $ host_is_superhost         : chr  "False" "False" "False" "False" ...
##  $ multi                     : int  0 1 1 1 0 0 0 0 0 1 ...
##  $ biz                       : int  0 0 0 0 1 1 1 1 1 0 ...
##  $ cleanliness_rating        : num  6 10 10 9 7 9 10 9 9 10 ...
##  $ guest_satisfaction_overall: num  69 96 95 87 65 93 97 88 87 97 ...
##  $ bedrooms                  : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ dist                      : num  5.73 4.79 4.6 2.05 4.49 ...
##  $ metro_dist                : num  0.437 1.464 0.45 0.133 0.354 ...
##  $ attr_index                : num  223 235 269 472 318 ...
##  $ attr_index_norm           : num  15.5 16.4 18.7 32.8 22.1 ...
##  $ rest_index                : num  470 530 549 1021 693 ...
##  $ rest_index_norm           : num  8.41 9.49 9.83 18.28 12.4 ...
##  $ lng                       : num  -0.0498 -0.0848 -0.1459 -0.1061 -0.188 ...
##  $ lat                       : num  51.5 51.5 51.5 51.5 51.5 ...

The summary statistics

summary(london)
##     realSum          room_type         room_shared        room_private      
##  Min.   :   54.33   Length:5379        Length:5379        Length:5379       
##  1st Qu.:  174.51   Class :character   Class :character   Class :character  
##  Median :  268.12   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  364.39                                                           
##  3rd Qu.:  438.27                                                           
##  Max.   :12937.27                                                           
##  person_capacity host_is_superhost      multi             biz        
##  Min.   :2.000   Length:5379        Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Mode  :character   Median :0.0000   Median :0.0000  
##  Mean   :2.858                      Mean   :0.2798   Mean   :0.3579  
##  3rd Qu.:4.000                      3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :6.000                      Max.   :1.0000   Max.   :1.0000  
##  cleanliness_rating guest_satisfaction_overall    bedrooms    
##  Min.   : 2.000     Min.   : 20.00             Min.   :0.000  
##  1st Qu.: 9.000     1st Qu.: 87.00             1st Qu.:1.000  
##  Median :10.000     Median : 94.00             Median :1.000  
##  Mean   : 9.194     Mean   : 90.92             Mean   :1.133  
##  3rd Qu.:10.000     3rd Qu.: 99.00             3rd Qu.:1.000  
##  Max.   :10.000     Max.   :100.00             Max.   :8.000  
##       dist            metro_dist        attr_index      attr_index_norm  
##  Min.   : 0.04056   Min.   :0.01388   Min.   :  68.74   Min.   :  4.778  
##  1st Qu.: 3.54568   1st Qu.:0.32404   1st Qu.: 177.22   1st Qu.: 12.320  
##  Median : 4.93914   Median :0.53613   Median : 247.65   Median : 17.215  
##  Mean   : 5.32762   Mean   :1.01653   Mean   : 294.58   Mean   : 20.477  
##  3rd Qu.: 6.83807   3rd Qu.:1.09076   3rd Qu.: 361.07   3rd Qu.: 25.099  
##  Max.   :17.32120   Max.   :9.17409   Max.   :1438.56   Max.   :100.000  
##    rest_index     rest_index_norm        lng                lat       
##  Min.   : 140.5   Min.   :  2.515   Min.   :-0.25170   Min.   :51.41  
##  1st Qu.: 382.1   1st Qu.:  6.839   1st Qu.:-0.16996   1st Qu.:51.49  
##  Median : 527.3   Median :  9.439   Median :-0.11813   Median :51.51  
##  Mean   : 625.6   Mean   : 11.197   Mean   :-0.11478   Mean   :51.50  
##  3rd Qu.: 764.2   3rd Qu.: 13.678   3rd Qu.:-0.06772   3rd Qu.:51.53  
##  Max.   :5587.1   Max.   :100.000   Max.   : 0.12018   Max.   :51.58

Handle Missing Data

colSums(is.na(london))
##                    realSum                  room_type 
##                          0                          0 
##                room_shared               room_private 
##                          0                          0 
##            person_capacity          host_is_superhost 
##                          0                          0 
##                      multi                        biz 
##                          0                          0 
##         cleanliness_rating guest_satisfaction_overall 
##                          0                          0 
##                   bedrooms                       dist 
##                          0                          0 
##                 metro_dist                 attr_index 
##                          0                          0 
##            attr_index_norm                 rest_index 
##                          0                          0 
##            rest_index_norm                        lng 
##                          0                          0 
##                        lat 
##                          0

Numerical/Categorical variables

target = 'realSum'
numerical_vars = c('dist', 'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'cleanliness_rating', 'guest_satisfaction_overall') #sapply(london, is.numeric)
categorical_vars = names(london)[!names(london) %in% union(numerical_vars, target)]

Identify Outliers

# Interquartile Range method (IQR)
get_outliers_idx = function(data, method = "iqr", threshold = 1) {
  if (method == "iqr") {
    Q1 = quantile(data, 0.25)
    Q3 = quantile(data, 0.75)
    
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    idx = which(data < lower_bound | data > upper_bound)
  }
  else if (method == "z-score") {
    z_scores = abs(scale(data))
    idx = which(z_scores <= threshold)
  }
  else {
    stop("method not found")
  }
    
  return(idx)
}

Remove outliers

idx = c()
for (col in c(numerical_vars, target)) {
  idx = union(idx, get_outliers_idx(london[[col]]))
}

london = london[-idx,]

Data Visualization: numerical variables

# par(mfrow = c(1, length(numerical_vars)))
for (col in numerical_vars) {
  hist(london[[col]], main = col, xlab = "Value", ylab = "Frequency", col = "steelblue")
}

The relationship between numerical variables

Pairplots

pairs(london[numerical_vars], col = "steelblue")

Correlation

round(cor(london[numerical_vars]), 2)
##                             dist metro_dist attr_index attr_index_norm
## dist                        1.00       0.38      -0.90           -0.90
## metro_dist                  0.38       1.00      -0.41           -0.41
## attr_index                 -0.90      -0.41       1.00            1.00
## attr_index_norm            -0.90      -0.41       1.00            1.00
## rest_index                 -0.89      -0.46       0.93            0.93
## rest_index_norm            -0.89      -0.46       0.93            0.93
## lng                         0.02       0.27       0.05            0.05
## lat                        -0.27      -0.17       0.17            0.17
## cleanliness_rating          0.09       0.11      -0.09           -0.09
## guest_satisfaction_overall  0.13       0.18      -0.15           -0.15
##                            rest_index rest_index_norm   lng   lat
## dist                            -0.89           -0.89  0.02 -0.27
## metro_dist                      -0.46           -0.46  0.27 -0.17
## attr_index                       0.93            0.93  0.05  0.17
## attr_index_norm                  0.93            0.93  0.05  0.17
## rest_index                       1.00            1.00 -0.02  0.26
## rest_index_norm                  1.00            1.00 -0.02  0.26
## lng                             -0.02           -0.02  1.00  0.27
## lat                              0.26            0.26  0.27  1.00
## cleanliness_rating              -0.10           -0.10 -0.04 -0.06
## guest_satisfaction_overall      -0.16           -0.16 -0.02 -0.04
##                            cleanliness_rating guest_satisfaction_overall
## dist                                     0.09                       0.13
## metro_dist                               0.11                       0.18
## attr_index                              -0.09                      -0.15
## attr_index_norm                         -0.09                      -0.15
## rest_index                              -0.10                      -0.16
## rest_index_norm                         -0.10                      -0.16
## lng                                     -0.04                      -0.02
## lat                                     -0.06                      -0.04
## cleanliness_rating                       1.00                       0.67
## guest_satisfaction_overall               0.67                       1.00
library(corrplot)
## corrplot 0.92 loaded
cor_mx = cor(london[numerical_vars])

corrplot(cor_mx, method = "color")

Data Visualization: Categorical variables

for (col in categorical_vars) {
  boxplot(realSum ~ london[[col]], data = london, 
        main = paste0("realSum vs ", col), 
        xlab = col, 
        ylab = "Real Sum",
        col = "steelblue")
}

Unique values

for (cat_var in categorical_vars) {
  print(unique(london[[cat_var]]))
}
## [1] "Private room"    "Entire home/apt" "Shared room"    
## [1] "False" "True" 
## [1] "True"  "False"
## [1] 2 3 4 6 5
## [1] "False" "True" 
## [1] 1 0
## [1] 0 1
## [1] 1 0 2 3 5 4

Transform to factor variables

london[categorical_vars] = lapply(london[categorical_vars], as.factor)

Encoding categorical variables

london$room_shared = ifelse(london$room_shared == "True", 1, 0)
london$room_private = ifelse(london$room_private == "True", 1, 0) 
london$host_is_superhost = ifelse(london$host_is_superhost == "True", 1, 0) 
london$room_type = as.numeric(london$room_type)

Exclude collinear predictors

cor(london$attr_index, london$attr_index_norm)
## [1] 1
cor(london$rest_index, london$rest_index_norm)
## [1] 1
london = subset(london, select = -c(attr_index, rest_index, room_shared, room_private))

The distribution of the target variable

num_bins = 15

bin_width = (max(london$realSum) - min(london$realSum)) / num_bins

breakpoints = seq(min(london$realSum), max(london$realSum), by = bin_width)

hist(london$realSum[london$realSum < max(london$realSum)],
     main = "Distribution of realSum",
     xlab = "realSum",
     ylab = "Frequency",
     breaks = breakpoints,
     col = "steelblue")

Main Effects Analysis

Split data into train/test

idx = sample(1:nrow(london), floor(0.8 * nrow(london)))

train_data = london[idx, ]
test_data = london[-idx, ]

target = "realSum"

X_train = train_data[, !(names(train_data) %in% target)]
y_train = train_data[[target]]
X_test = test_data[, !(names(test_data) %in% target)]
y_test = test_data[[target]]

Function to calculate RMSE

calc_rmse = function(actual_values, predicted_values) {
  rmse = sqrt(mean((actual_values - predicted_values) ^ 2))
  
  return(rmse)
}

Performs diagnostic calculations for linear models

diag = function(model, name){
  resid = fitted(model) - london$realSum
  rmse = sqrt(sum(resid^2)/nrow(london))
  r2 = summary(model)$r.squared
  coefs = nrow(summary(model)$coeff)
  data.frame(model = name, rmse = rmse, r2 = r2, coefficients = coefs )
}

Fit a full additive model

full_additive_mod = lm(realSum ~ ., data = london)
summary(full_additive_mod)
## 
## Call:
## lm(formula = realSum ~ ., data = london)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -295.34  -58.97  -12.87   40.07  633.74 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2079.4323  3025.2963   0.687 0.491904    
## room_type                  -156.8349     3.8600 -40.631  < 2e-16 ***
## person_capacity3             22.6491     5.4863   4.128 3.73e-05 ***
## person_capacity4             62.1017     4.9795  12.472  < 2e-16 ***
## person_capacity5             70.2959     9.7317   7.223 6.06e-13 ***
## person_capacity6            121.6284     8.8126  13.802  < 2e-16 ***
## host_is_superhost             4.8053     4.1339   1.162 0.245140    
## multi1                        8.8460     3.8212   2.315 0.020667 *  
## biz1                         19.7699     4.1121   4.808 1.58e-06 ***
## cleanliness_rating           23.1461     2.7877   8.303  < 2e-16 ***
## guest_satisfaction_overall    0.7166     0.2979   2.406 0.016186 *  
## bedrooms1                    53.7163     6.0696   8.850  < 2e-16 ***
## bedrooms2                   155.9215     8.1215  19.199  < 2e-16 ***
## bedrooms3                   182.8555    15.0762  12.129  < 2e-16 ***
## bedrooms4                   110.3803    36.0825   3.059 0.002235 ** 
## bedrooms5                    61.4314    93.7029   0.656 0.512121    
## dist                          6.7243     1.9839   3.389 0.000707 ***
## metro_dist                  -12.2260     3.6522  -3.348 0.000823 ***
## attr_index_norm               1.4684     0.6252   2.349 0.018892 *  
## rest_index_norm               9.9394     1.1618   8.555  < 2e-16 ***
## lng                        -156.2512    29.0213  -5.384 7.71e-08 ***
## lat                         -40.4128    58.6179  -0.689 0.490595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.43 on 3920 degrees of freedom
## Multiple R-squared:  0.6757, Adjusted R-squared:  0.674 
## F-statistic:   389 on 21 and 3920 DF,  p-value: < 2.2e-16
linear_exp_results = diag(full_additive_mod, "full_additive_mod")
linear_exp_results
##               model     rmse       r2 coefficients
## 1 full_additive_mod 93.16959 0.675745           22

Fitting an additive model with no interactions to the data yields an \(R^2 = 0.67574\). We will attempt to improve on this.

We will use AIC to determine if any of the predictors can be dropped.

aic_full_additive_mod = step(full_additive_mod, direction = "backward", trace = FALSE)
summary(aic_full_additive_mod)
## 
## Call:
## lm(formula = realSum ~ room_type + person_capacity + multi + 
##     biz + cleanliness_rating + guest_satisfaction_overall + bedrooms + 
##     dist + metro_dist + attr_index_norm + rest_index_norm + lng, 
##     data = london)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -295.58  -58.80  -13.08   40.54  634.62 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -12.4669    31.9898  -0.390 0.696768    
## room_type                  -156.4758     3.8493 -40.651  < 2e-16 ***
## person_capacity3             22.2907     5.4796   4.068 4.84e-05 ***
## person_capacity4             61.8693     4.9763  12.433  < 2e-16 ***
## person_capacity5             69.9986     9.7284   7.195 7.43e-13 ***
## person_capacity6            121.4798     8.8110  13.787  < 2e-16 ***
## multi1                        9.2028     3.8062   2.418 0.015659 *  
## biz1                         19.7525     4.1108   4.805 1.61e-06 ***
## cleanliness_rating           23.5866     2.7669   8.525  < 2e-16 ***
## guest_satisfaction_overall    0.7458     0.2959   2.521 0.011756 *  
## bedrooms1                    53.7896     6.0692   8.863  < 2e-16 ***
## bedrooms2                   156.0994     8.1202  19.224  < 2e-16 ***
## bedrooms3                   182.9747    15.0753  12.137  < 2e-16 ***
## bedrooms4                   109.9049    36.0759   3.046 0.002331 ** 
## bedrooms5                    60.1752    93.6959   0.642 0.520755    
## dist                          7.0379     1.9240   3.658 0.000258 ***
## metro_dist                  -11.7688     3.5824  -3.285 0.001028 ** 
## attr_index_norm               1.6054     0.5854   2.743 0.006125 ** 
## rest_index_norm               9.7581     1.1283   8.649  < 2e-16 ***
## lng                        -164.8004    26.5292  -6.212 5.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.43 on 3922 degrees of freedom
## Multiple R-squared:  0.6756, Adjusted R-squared:  0.674 
## F-statistic: 429.9 on 19 and 3922 DF,  p-value: < 2.2e-16
row = diag(aic_full_additive_mod, "aic_full_additive_mod")
row
##                   model     rmse        r2 coefficients
## 1 aic_full_additive_mod 93.19119 0.6755946           20
linear_exp_results = rbind(linear_exp_results, row)

AIC eliminates just host_is_superhost, and lat. The \(R^2\) value of this model left after AIC elimination is almost the same at 0.6755946 as compared to the full additive model with an \(R^2\) of 0.67574. \(RMSE\) is about the same.

We will now try BIC to see if this results in a smaller model.

bic_full_additive_mod = step(full_additive_mod, direction = "backward", k = log(nrow(london)), trace = FALSE)
summary(bic_full_additive_mod)
## 
## Call:
## lm(formula = realSum ~ room_type + person_capacity + biz + cleanliness_rating + 
##     bedrooms + metro_dist + rest_index_norm + lng, data = london)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -298.66  -59.69  -12.90   40.73  627.20 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          88.397     23.232   3.805 0.000144 ***
## room_type          -154.393      3.809 -40.538  < 2e-16 ***
## person_capacity3     22.383      5.474   4.089 4.42e-05 ***
## person_capacity4     62.129      4.974  12.490  < 2e-16 ***
## person_capacity5     70.469      9.731   7.242 5.31e-13 ***
## person_capacity6    122.391      8.792  13.920  < 2e-16 ***
## biz1                 13.563      3.513   3.861 0.000115 ***
## cleanliness_rating   27.906      2.172  12.847  < 2e-16 ***
## bedrooms1            52.790      6.056   8.717  < 2e-16 ***
## bedrooms2           154.977      8.105  19.121  < 2e-16 ***
## bedrooms3           182.615     15.063  12.123  < 2e-16 ***
## bedrooms4           105.746     36.140   2.926 0.003453 ** 
## bedrooms5            58.434     93.930   0.622 0.533908    
## metro_dist          -12.342      3.575  -3.453 0.000561 ***
## rest_index_norm       9.665      0.447  21.625  < 2e-16 ***
## lng                -150.363     26.000  -5.783 7.90e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.69 on 3926 degrees of freedom
## Multiple R-squared:  0.6734, Adjusted R-squared:  0.6722 
## F-statistic: 539.8 on 15 and 3926 DF,  p-value: < 2.2e-16
row = diag(bic_full_additive_mod, "bic_full_additive_model")
row
##                     model     rmse        r2 coefficients
## 1 bic_full_additive_model 93.50005 0.6734407           16
linear_exp_results = rbind(linear_exp_results, row)

BIC, as we would expect, produces a smaller model.

As there is little difference between these models, we will continue the analysis with the full linear model. In summary, no one simple additive model stands out.

Finding the Best Additive Model Using Brute Force

We performed a brute force exploration of all linear additive models and obtained the model below. The model was searched for by lowest RMSE. The code is not shown as it takes quite some to run.

brute_force = lm(realSum ~ room_type + person_capacity + multi + biz + bedrooms + metro_dist + rest_index_norm + lat, data = london)
row = diag(brute_force, "brute_force")
row
##         model     rmse        r2 coefficients
## 1 brute_force 95.70173 0.6578804           16
linear_exp_results = rbind(linear_exp_results, row)

Summary of performance for linear models

library(knitr)
kable(linear_exp_results)
model rmse r2 coefficients
full_additive_mod 93.16959 0.6757450 22
aic_full_additive_mod 93.19119 0.6755946 20
bic_full_additive_model 93.50005 0.6734407 16
brute_force 95.70173 0.6578804 16

Collinearity Analysis

Although collinearity likely will not affect prediction, it will affect our ability to perform inference tests.

Let us check for collinearity using the variance inflation factor.

Calculate VIF

library(faraway)
vif(full_additive_mod)
##                  room_type           person_capacity3 
##                   1.712194                   1.113846 
##           person_capacity4           person_capacity5 
##                   1.649836                   1.231677 
##           person_capacity6          host_is_superhost 
##                   1.792595                   1.127032 
##                     multi1                       biz1 
##                   1.335385                   1.738527 
##         cleanliness_rating guest_satisfaction_overall 
##                   1.839646                   2.065436 
##                  bedrooms1                  bedrooms2 
##                   2.830461                   3.179536 
##                  bedrooms3                  bedrooms4 
##                   1.462706                   1.042173 
##                  bedrooms5                       dist 
##                   1.005581                   6.482776 
##                 metro_dist            attr_index_norm 
##                   1.490631                  10.692268 
##            rest_index_norm                        lng 
##                   9.227965                   1.406944 
##                        lat 
##                   1.397655
vif(full_additive_mod)[unname(vif(full_additive_mod)) > 5]
##            dist attr_index_norm rest_index_norm 
##        6.482776       10.692268        9.227965

Per the textbook, predictors with a VIF of more than 5 are suspicious for collinearity. The variables with VIF values more than 5 include dist, attr_index_norm, and rest_index_norm. This makes sense as dist is the distance to the city center, and it is likely that if this is small, the number of attractions and restaurants (as measured by attr_index_norm and rest_index_norm) would be high. There are more attractions and restaurants near the city center. Also, metro_dist, the distance to a metro station, would likely be small if dist is small. These values are not tremendously larger than 5, so we will keep them in model.

Transformation Analysis

What does the distribution of the room prices look like?

hist(london$realSum, 
     main = "Distribution of realSum", 
     xlab = "realSum", col = "steelblue")

Maybe this would look better if we took the logarithm. This is a typical transformation for prices that can vary over several orders of magnitude.

hist(log(london$realSum), prob = TRUE, 
     main = "Distribution of log(realSum)",
     xlab = "log(realSum", col = "steelblue")
curve(dnorm(x, mean = mean(log(london$realSum)), sd = sd(log(london$realSum))), 
      col = "darkorange", add = TRUE, lwd = 3)

This does look much better. The values are more spread out, and approximate a normal distribution. Let us try to fit a model with the log transformation of the response.

Function to perform diagnostics on functions with log transformation of response

log_diag = function(model, name){
  resid = exp(fitted(model)) - london$realSum
  rmse = sqrt(sum(resid^2)/nrow(london))
  r2 = summary(model)$r.squared
  coefs = nrow(summary(model)$coeff)
  data.frame(model = name, rmse = rmse, r2 = r2, coefficients = coefs )
}

Fit a model with log transformation of response and all available predictors

full_log_model = lm(log(realSum) ~ ., data = london)
summary(full_log_model)
## 
## Call:
## lm(formula = log(realSum) ~ ., data = london)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86979 -0.19445 -0.02429  0.16259  1.61758 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3.1417549  9.2756335   0.339  0.73485    
## room_type                  -0.5453170  0.0118349 -46.077  < 2e-16 ***
## person_capacity3            0.0989368  0.0168212   5.882 4.40e-09 ***
## person_capacity4            0.2042732  0.0152671  13.380  < 2e-16 ***
## person_capacity5            0.2409053  0.0298376   8.074 8.98e-16 ***
## person_capacity6            0.3037975  0.0270196  11.244  < 2e-16 ***
## host_is_superhost           0.0272539  0.0126747   2.150  0.03159 *  
## multi1                      0.0302205  0.0117160   2.579  0.00993 ** 
## biz1                        0.0411765  0.0126078   3.266  0.00110 ** 
## cleanliness_rating          0.0813609  0.0085470   9.519  < 2e-16 ***
## guest_satisfaction_overall  0.0018751  0.0009132   2.053  0.04011 *  
## bedrooms1                   0.1227469  0.0186095   6.596 4.79e-11 ***
## bedrooms2                   0.3496422  0.0249007  14.041  < 2e-16 ***
## bedrooms3                   0.4276488  0.0462242   9.252  < 2e-16 ***
## bedrooms4                   0.3303286  0.1106298   2.986  0.00285 ** 
## bedrooms5                   0.2309309  0.2872954   0.804  0.42156    
## dist                        0.0145217  0.0060826   2.387  0.01702 *  
## metro_dist                 -0.0496261  0.0111976  -4.432 9.60e-06 ***
## attr_index_norm             0.0048881  0.0019170   2.550  0.01081 *  
## rest_index_norm             0.0294681  0.0035620   8.273  < 2e-16 ***
## lng                        -0.6087828  0.0889799  -6.842 9.04e-12 ***
## lat                         0.0314020  0.1797240   0.175  0.86131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 3920 degrees of freedom
## Multiple R-squared:  0.6889, Adjusted R-squared:  0.6873 
## F-statistic: 413.4 on 21 and 3920 DF,  p-value: < 2.2e-16
log_diag(full_log_model, "full_log_model")
##            model     rmse        r2 coefficients
## 1 full_log_model 93.69018 0.6889451           22

The log transformation gives a better \(R^2\). The \(R^2\) value increased from around 0.6757450 with the linear model using all predictors to 0.6889451 with the log transform. The log model also has lower RMSE.

Does the Box-Cox transformation work better than log transformation of the response?

Inverse of Box-Cox to display results

invBoxCox = function(x, lambda)
            if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda)

Perform Box-Cox transform

library(MASS)
bc = boxcox(full_additive_mod)

(lambda = bc$x[which.max(bc$y)])
## [1] 0.02020202
bc_model =  lm(((realSum^lambda-1)/lambda) ~ ., data = london)
resid = invBoxCox(fitted(bc_model), lambda) - london$realSum
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(bc_model)$r.squared
coefs = nrow(summary(bc_model)$coeff)
data.frame(model = "bc_model", rmse = rmse, r2 = r2, coefficients = coefs )
##      model     rmse        r2 coefficients
## 1 bc_model 93.59655 0.6895857           22

The Box-Cox transform did not help significnantly.

We will now try transformations of other variables. Distance may have a skewed distribution, with being very close to the city center being more important.

hist(log(london$dist),
     main = "Distribution of dist", 
     xlab = "dist", col = "steelblue",
     prob = TRUE)

hist(sqrt(london$dist), 
     main = "Distribution of sqrt*dist)", 
     xlab = "sqrt(dist)", col = "steelblue", prob = TRUE)
curve(dnorm(x, mean = mean(sqrt(london$dist)), sd = sd(sqrt(london$dist))), 
      col = "darkorange", add = TRUE, lwd = 3)

The transform of distance may allow for a better model. We will try a number of transformations.

Summary of performance for transformation models

transform_results
##                               model      rmse        r2 coefficients
## 1                            dist^2  93.65850 0.6892697           23
## 2                            dist^3  93.68354 0.6890521           23
## 3                        sqrt(dist)  93.58824 0.6898447           23
## 4                         log(dist)  93.56241 0.6900572           23
## 5                            1/dist  93.52151 0.6903879           23
## 6                            poly 4  93.40063 0.6911002           25
## 7                      metro_dist^2  93.68063 0.6889889           23
## 8                      metro_dist^3  93.67733 0.6890096           23
## 9                  sqrt(metro_dist)  93.68713 0.6889615           23
## 10                  log(metro_dist)  93.68867 0.6889534           23
## 11                     1/metro_dist  93.69047 0.6889453           23
## 12             cleanliness_rating^2  93.69926 0.6896434           23
## 13             cleanliness_rating^3  93.69926 0.6896434           23
## 14         sqrt(cleanliness_rating)  93.69926 0.6896434           23
## 15          log(cleanliness_rating)  93.69926 0.6896434           23
## 16             1/cleanliness_rating  93.69926 0.6896434           23
## 17     guest_satisfaction_overall^2  93.42177 0.6912219           23
## 18     guest_satisfaction_overall^3  93.42298 0.6912350           23
## 19 sqrt(guest_satisfaction_overall)  93.42103 0.6911926           23
## 20  log(guest_satisfaction_overall)  93.42111 0.6911799           23
## 21     1/guest_satisfaction_overall  93.42180 0.6911502           23
## 22                attr_index_norm^2  93.26696 0.6909701           23
## 23                attr_index_norm^3  93.33142 0.6905552           23
## 24            sqrt(attr_index_norm)  93.19488 0.6915052           23
## 25             log(attr_index_norm)  93.19333 0.6915609           23
## 26                1/attr_index_norm  93.24204 0.6913636           23
## 27                rest_index_norm^2  93.10515 0.6934284           23
## 28                rest_index_norm^3  93.16684 0.6929168           23
## 29            sqrt(rest_index_norm)  93.07934 0.6937721           23
## 30             log(rest_index_norm)  93.10136 0.6936802           23
## 31                1/rest_index_norm  93.19633 0.6931338           23
## 32                            lat^2  93.45178 0.6896209           23
## 33                            lat^3  93.45172 0.6896211           23
## 34                        sqrt(lat)  93.69018 0.6889451           22
## 35                         log(lat)  93.69018 0.6889451           22
## 36                            1/lat  93.45194 0.6896202           23
## 37                            lng^2  93.48710 0.6894534           23
## 38                            lng^3  93.65829 0.6889698           23
## 39                        sqrt(lng) 226.11375 0.9272024           21
## 40                         log(lng) 226.19698 0.9256809           21
## 41                            1/lng  93.67125 0.6892571           23

The biggest improvement seems to be with sqrt(rest_index_norm) both in terms of \(R^2\) and \(RMSE\).

Interaction Analysis

Next do a model with interactions, and then reduce the number of variables with backward AIC. We will start with the full log model.

Fit a full interaction model

full_interact = lm(log(realSum) ~ .^2, data = london)
interact_results = log_diag(full_interact, "full_interact")
interact_results
##           model     rmse        r2 coefficients
## 1 full_interact 87.85819 0.7242581          187

The full interaction model has a higher \(R^2\) value than any of the transformation models above. We will have to check if there is overfitting. The full_interact model has a significantly lower \(RMSE\) than the transformation models. The \(RMSE\) is about 5 units lower for the interaction model than the transformation models.

The full interaction model is quite large. Let us use AIC and BIC to reduce the number of predictors.

AIC decreases the number of coefficients from 214 in the full interaction model to 113 with a decrease in \(R^2\) from 0.7242581 to 0.7199981, and an increase in \(RMSE\) from 87.85819 to 88.8297. This is with a decrease of on the order of 100 predictors.

Try BIC to get a smaller model

BIC decreases the number of coefficients from 214 in the full interaction model to 45 with a decrease in \(R^2\) from 0.7242581 to 0.705297, and an increase in \(RMSE\) from 87.85819 to 91.23548. This is with a decrease of on the order of 157 predictors.

Putting It Together

Try a model with transformations and interactions. Start with the model from the BIC elimination above and add the sqrt(rest_index_norm) transformation.

Fit the combined model

combined_model =  lm(log(realSum) ~ room_type + person_capacity + multi + 
    biz + cleanliness_rating + guest_satisfaction_overall + bedrooms + 
    dist + metro_dist + attr_index_norm + rest_index_norm + lng + 
    lat + room_type:multi + room_type:cleanliness_rating + biz:lat + 
    guest_satisfaction_overall:rest_index_norm + dist:lng + dist:lat + 
    metro_dist:lng + metro_dist:lat + attr_index_norm:rest_index_norm + sqrt(rest_index_norm), data = london)

combined_results = log_diag(combined_model, "combined_model")
combined_results
##            model     rmse        r2 coefficients
## 1 combined_model 91.09935 0.7068179           31

Adding sqrt(rest_index_norm) doesn’t really change model perfomance.

Let us remove predictors to increase explanatory power. I removed predictors with collinearity and low p-values

smaller_combined_model =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london)

vif(smaller_combined_model)
##                                     room_type 
##                                      1.710470 
##                              person_capacity3 
##                                      1.112521 
##                              person_capacity4 
##                                      1.649925 
##                              person_capacity5 
##                                      1.231522 
##                              person_capacity6 
##                                      1.791708 
##                            cleanliness_rating 
##                                    177.315766 
##                    guest_satisfaction_overall 
##                                    177.658606 
##                                     bedrooms1 
##                                      2.862126 
##                                     bedrooms2 
##                                      3.191308 
##                                     bedrooms3 
##                                      1.466244 
##                                     bedrooms4 
##                                      1.041868 
##                                     bedrooms5 
##                                      1.005572 
##                                          dist 
##                                      5.772798 
##                               attr_index_norm 
##                                      5.633999 
##                                           lng 
##                                      1.174954 
##             guest_satisfaction_overall:multi1 
##                                      1.311323 
##                       cleanliness_rating:biz1 
##                                      1.645178 
## cleanliness_rating:guest_satisfaction_overall 
##                                    587.125838 
##                          dist:rest_index_norm 
##                                      1.251359 
##                    attr_index_norm:metro_dist 
##                                      1.234922
row = log_diag(smaller_combined_model, "smaller_combined_model")
row
##                    model     rmse        r2 coefficients
## 1 smaller_combined_model 93.62872 0.6900676           21
combined_results = rbind(smaller_combined_model, "smaller_combined_model")

Assumption Analysis

Function to perform diagnostic tests of LINE assumptions

assumpt = function(model, pcol = "gray", lcol = "dodgerblue", alpha = 0.05){
    par(mfrow=c(1,2)) 
    plot(fitted(model), resid(model), col = pcol, pch = 20,
         xlab = "Fitted", ylab = "Residuals", 
         main = paste("Fitted vs. Residuals for ", substitute(model), sep = ""))
    abline(h = 0, col = lcol, lwd = 2)
    qqnorm(resid(model), main = paste("Normal Q-Q Plot for ", substitute(model), sep = ""), col = pcol, pch = 20)
    qqline(resid(model), col = lcol, lwd = 2)
}
assumpt(full_additive_mod)

assumpt(full_interact)

assumpt(combined_model)

assumpt(smaller_combined_model)

assumpt(brute_force)

table = data.frame()
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
tab = data.frame(Model = "full_additive_mod", Shapiro = unname(shapiro.test(resid(full_additive_mod))$p.value), BP = unname(bptest(full_additive_mod)$p.value))
row = data.frame(Model = "full_interact", Shapiro = unname(shapiro.test(resid(full_interact))$p.value), BP = unname(bptest(full_interact)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "combined_model", Shapiro = unname(shapiro.test(resid(combined_model))$p.value), BP = unname(bptest(combined_model)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "smaller_combined_model", Shapiro = unname(shapiro.test(resid(smaller_combined_model))$p.value), 
                                BP = unname(bptest(smaller_combined_model)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "brute_force", Shapiro = unname(shapiro.test(resid(brute_force))$p.value), BP = unname(bptest(brute_force)$p.value))
tab = rbind(tab, row)
kable(tab)
Model Shapiro BP
full_additive_mod 0 0.0e+00
full_interact 0 3.8e-06
combined_model 0 0.0e+00
smaller_combined_model 0 0.0e+00
brute_force 0 0.0e+00

Removal of Unusual Values

In the above analysis, it looks as if the smaller_combined_model works the best. We will now eliminate unusual values.

plot(smaller_combined_model)
## Warning: not plotting observations with leverage one:
##   161

unusual_index = cooks.distance(smaller_combined_model)>(4/nrow(london))
london_no_unusual = london[!unusual_index,]
london_no_unusual = na.omit(london_no_unusual)
nrow(london) - nrow(london_no_unusual)
## [1] 173
(nrow(london) - nrow(london_no_unusual))/nrow(london)
## [1] 0.04388635
max(london_no_unusual$realSum)
## [1] 833.0393

Removing points with a cooks distance > 4/n eliminates 248 points, or 4.6% of the total number of rows.

library(lmtest)
smaller_combined_no_unusual =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual)
log_diag(smaller_combined_no_unusual, "smaller_combined_no_unusual")
## Warning in exp(fitted(model)) - london$realSum: longer object length is not a
## multiple of shorter object length
##                         model     rmse        r2 coefficients
## 1 smaller_combined_no_unusual 211.0687 0.7569453           20
assumpt(smaller_combined_no_unusual)

plot(smaller_combined_no_unusual)
## Warning: not plotting observations with leverage one:
##   1239

shapiro.test(resid(smaller_combined_no_unusual)[1:4980])
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(smaller_combined_no_unusual)[1:4980]
## W = 0.99516, p-value = 7.9e-10
bptest(smaller_combined_no_unusual)
## 
##  studentized Breusch-Pagan test
## 
## data:  smaller_combined_no_unusual
## BP = 129.68, df = 19, p-value < 2.2e-16

What if we exclude points with large leverage?

leverage_index = hatvalues(smaller_combined_no_unusual)>(2*mean(hatvalues(smaller_combined_no_unusual)))
sum(leverage_index)
## [1] 211
fewer = london[!leverage_index,]
smaller_b =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = fewer)
log_diag(smaller_b, "smaller_combined_no_unusual_b")
## Warning in exp(fitted(model)) - london$realSum: longer object length is not a
## multiple of shorter object length
##                           model     rmse        r2 coefficients
## 1 smaller_combined_no_unusual_b 211.7355 0.6940605           21
assumpt(smaller_b)

plot(smaller_b)
## Warning: not plotting observations with leverage one:
##   154

shapiro.test(resid(smaller_b)[1:4980])
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(smaller_b)[1:4980]
## W = 0.97993, p-value < 2.2e-16
library(MASS)
smaller_combined_bc_nu =  lm(((realSum^lambda-1)/lambda) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual)
assumpt(smaller_combined_bc_nu)

hist(((london$realSum^lambda-1)/lambda), prob = TRUE)
curve(dnorm(x, mean = mean(((london$realSum^lambda-1)/lambda)), sd = sd(((london$realSum^lambda-1)/lambda))), 
      col = "darkorange", add = TRUE, lwd = 3)
residuals = unname(resid(smaller_combined_bc_nu))[1:4998]
shapiro.test(residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.9948, p-value = 2.485e-10
resid = invBoxCox(fitted(smaller_combined_bc_nu), lambda) - london$realSum
## Warning in invBoxCox(fitted(smaller_combined_bc_nu), lambda) - london$realSum:
## longer object length is not a multiple of shorter object length
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(smaller_combined_bc_nu)$r.squared
coefs = nrow(summary(smaller_combined_bc_nu)$coeff)
data.frame(model = "smaller_combined_bc_nu", rmse = rmse, r2 = r2, coefficients = coefs )
##                    model     rmse        r2 coefficients
## 1 smaller_combined_bc_nu 210.9328 0.7578903           20

Model Evaluation

sample = sample(c(TRUE, FALSE), nrow(london_no_unusual), replace=TRUE, prob=c(0.7,0.3))
london_train  = london_no_unusual[sample, ]
london_test   = london_no_unusual[!sample, ]
log_predict_diag = function(true_data, fit_data, model, dataset){
  resid = exp(unname(fit_data)) - unname(true_data)

  rmse = sqrt(sum(resid^2)/length(fit_data))
  data.frame(model = model, dataset = dataset, rmse = rmse)
}
smaller_combined_no_unusual_train =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london_train)

test_fit = predict(smaller_combined_no_unusual_train, london_test)
eval_results = log_predict_diag(london_test$realSum, test_fit, "smaller_combined_no_unusual_test", "test")
eval_results
##                              model dataset     rmse
## 1 smaller_combined_no_unusual_test    test 76.02098
train_fit = predict(smaller_combined_no_unusual_train, london_train)
row = log_predict_diag(london_train$realSum, train_fit, "smaller_combined_no_unusual_train", "training")
row
##                               model  dataset     rmse
## 1 smaller_combined_no_unusual_train training 77.23327
eval_results = rbind(eval_results, row)


plot(exp(unname(test_fit)) ~ unname(london_test$realSum))
abline(1,1)

full_additive_no_unusual_train =  lm(log(realSum) ~ ., data = london_train)

test_fit = predict(full_additive_no_unusual_train, london_test)
row = log_predict_diag(london_test$realSum, test_fit, "full_additive_no_unusual_test", "test")
row
##                           model dataset     rmse
## 1 full_additive_no_unusual_test    test 76.46597
eval_results = rbind(eval_results, row)

train_fit = predict(full_additive_no_unusual_train, london_train)
row = log_predict_diag(london_train$realSum, train_fit, "full_additive_no_unusual_train", "training")
row
##                            model  dataset     rmse
## 1 full_additive_no_unusual_train training 77.13113
eval_results = rbind(eval_results, row)
full_interact_no_unusual_train =  lm(log(realSum) ~ .^2, data = london_train)

test_fit = predict(full_interact_no_unusual_train, london_test)
## Warning in predict.lm(full_interact_no_unusual_train, london_test): prediction
## from rank-deficient fit; attr(*, "non-estim") has doubtful cases
row = log_predict_diag(london_test$realSum, test_fit, "full_interact_no_unusual_train", "test")
row
##                            model dataset    rmse
## 1 full_interact_no_unusual_train    test 77.3869
eval_results = rbind(eval_results, row)

train_fit = predict(full_interact_no_unusual_train, london_train)
## Warning in predict.lm(full_interact_no_unusual_train, london_train): prediction
## from rank-deficient fit; attr(*, "non-estim") has doubtful cases
row = log_predict_diag(london_train$realSum, train_fit, "full_interact_no_unusual_train", "training")
row
##                            model  dataset     rmse
## 1 full_interact_no_unusual_train training 72.77005
eval_results = rbind(eval_results, row)
kable(eval_results)
model dataset rmse
smaller_combined_no_unusual_test test 76.02098
smaller_combined_no_unusual_train training 77.23327
full_additive_no_unusual_test test 76.46597
full_additive_no_unusual_train training 77.13113
full_interact_no_unusual_train test 77.38690
full_interact_no_unusual_train training 72.77005

As we can see from the correlation plot using the predicted values from smaller_combined_no_unusual_train, the model seems to fall down and underestimate the true values of higher priced rooms. Maybe there is some intangible reason for the expense of these listings that cannot be explained by the predictors. What happens if we exclude the listings with realSum values more than 1000?

london_no_unusual_no_outliers = london_no_unusual[!(london_no_unusual$realSum >1000),]
hA = hist(log(london_no_unusual_no_outliers$realSum), prob = TRUE, plot = FALSE)
## Warning in hist.default(log(london_no_unusual_no_outliers$realSum), prob =
## TRUE, : argument 'probability' is not made use of
hB = hist(log(london_no_unusual$realSum), prob = TRUE, plot = FALSE)
## Warning in hist.default(log(london_no_unusual$realSum), prob = TRUE, plot =
## FALSE): argument 'probability' is not made use of
plot(hB, col = "dodgerblue")
plot(hA, col = "darkorange", add = TRUE)

smaller_combined_no_unusual_no_outliers_mod =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + log(dist) + attr_index_norm + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                       
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual_no_outliers)

plot(smaller_combined_no_unusual_no_outliers_mod)
## Warning: not plotting observations with leverage one:
##   1239

test_fit = predict(smaller_combined_no_unusual_no_outliers_mod, london_no_unusual_no_outliers)

plot(exp(unname(test_fit)) ~ unname(london_no_unusual_no_outliers$realSum))
abline(1,1)

resid = exp(fitted(smaller_combined_no_unusual_no_outliers_mod)) - london_no_unusual_no_outliers$realSum
rmse = sqrt(sum(resid^2)/nrow(london_no_unusual_no_outliers))
r2 = summary(smaller_combined_no_unusual_no_outliers_mod)$r.squared
coefs = nrow(summary(smaller_combined_no_unusual_no_outliers_mod)$coeff)
data.frame(model = "smaller_combined_no_unusual_no_outliers_mod", rmse = rmse, r2 = r2, coefficients = coefs )
##                                         model     rmse        r2 coefficients
## 1 smaller_combined_no_unusual_no_outliers_mod 76.66374 0.7580138           21
shapiro.test(resid(smaller_combined_no_unusual_no_outliers_mod)[1:4999])
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(smaller_combined_no_unusual_no_outliers_mod)[1:4999]
## W = 0.99521, p-value = 9.498e-10
bptest(smaller_combined_no_unusual_no_outliers_mod)
## 
##  studentized Breusch-Pagan test
## 
## data:  smaller_combined_no_unusual_no_outliers_mod
## BP = 131.5, df = 20, p-value < 2.2e-16
nrow(london_no_unusual_no_outliers)
## [1] 3769

Results

mean(london$realSum)
## [1] 309.1503

The simple fully additive model, full_additive_mod, provided for a high RMSE of 372.2035 and an \(R^2\) value of 0.2768903 when predicting the Airbnb . This seems quite poor, as the average realSum value is 364.3897. We tried a number of methods to improve this model, including removing unusual values, AIC, and BIC, without much improvement. We tried having the integer predictors as factors and not as factors. This did not help. Looking at the pairs plot, the relationship between many of the variables appears nonlinear, indicating that the addition of predictor interactions and transformations may be helpful.

The most significant, by far, improvement in our modelling was was the log transform of realSum. This model full_log_model had a much improved \(R^2\), but still high RMSE, at 0.6877577 and 371.2015, respectively.

plot(fitted(full_additive_mod), london$realSum)
abline(1,1)

plot(exp(fitted(full_log_model)), london$realSum)
abline(1,1)

As you can see from the plots above, for both the full_additive_mod and full_log_model, there are still a large number of outliers.

We tried a number of other transformations, none of which had much of an impact on the apparent fit of the model. Next, we tried the full interaction model. This improved the \(R^2\) to 0.7335292, but RMSE was about the same. This model also showed evidence of overfitting, as the RMSE went up significantly when used for prediction. In the end, we used BIC to reduce the full interaction model to a more tractable model smaller_combined_model. The smaller_combined_model was the mainstay of the rest of our analysis.

Our suspicion grew that unusual observations may be leading to large RMSE values. This can be seen in

smaller_combined_model =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london)
plot(smaller_combined_model)
## Warning: not plotting observations with leverage one:
##   161

4/nrow(london)
## [1] 0.001014713

In the end, we eliminated 248 rows of data that had a cook’s distance of over 4/n. This improved model fit significantly, and RMSE values decreased to the 110s for many of our models.

smaller_combined_no_unusual_train =  lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
                     + bedrooms + dist + attr_index_norm  + lng + 
                     + multi:guest_satisfaction_overall + biz:cleanliness_rating                     
                     + cleanliness_rating:guest_satisfaction_overall                        
                     + dist:rest_index_norm + metro_dist:attr_index_norm, data = london_train)

test_fit = predict(smaller_combined_no_unusual_train, london_test)
plot(exp(unname(test_fit)) ~ unname(london_test$realSum))
abline(1,1)

One difficulty that we had was assuring ourselves that our model did not violate the assumptions of constant variance and normality. Our best model, smaller_combined_no_unusual_train, did look to have a fairly good Q-Q plot and Predicted vs. Residuals plot.

assumpt(smaller_combined_no_unusual)

This model did not pass the Shapiro-Wilke test, but apparently this test does not work with large sample sizes, leading to erroneous rejection of the null hypothesis of normal variance.

We also did try a Box-Cox transformation that did not help much over the log transformation of the response variable.

In the end our model consisting of the predictors: room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall + bedrooms + dist + attr_index_norm + lng + + multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm seemed optimal in terms of coming close to satisfying the LINE assumptions, providing low RMSE, and being interpretable. The main predictors, including room_type, person_capacity, cleanliness_rating, guest_satisfaction, and bedrooms seem like things that would impact the cost of a room. Also, distance to the city and attractions would make a room more desirable. The interactions make sense as well, implying synergies between predictors such as distance to the metro and attractions.

Discussion

The Airbnb dataset we chose for our project contains a large number of predictors giving information about a truly subjective number: How much are you willing to pay for a room. We constructed an interpretable model with a reasonable number of predictors that has a low RMSE, \(R^2\) value and doesn’t overfit the data. On graphical testing, the model appeared to adhere to the LINE assumptions.

There were a number of challenges in our analysis. Determining the correct data transformations and interactions was more art than science. The most important manipulations in our analysis were the log transformation of the response and elimination of the unusual observations.

In the end, the biggest challenge of our model was to predict higher priced rentals. Maybe there are qualities of these listings that the data cannot capture.